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I.  INTRODUCTION  AND  SUMMARY 

Variable  frame  rate  transmission  algorithms  are  an  attractive  approach  for 
voice  transmission  on  a packet  switched  network  such  as  the  ARPANET.  They 
lower  the  total  number  of  bits  transmitted  In  a nonuniform  manner,  which  a packet 
network  can  take  advantage  of  by  sending  other  messages  during  time  when  few  bits 
are  needed  for  speech.  However,  since  VFR  algorithms  require  Information  from 
many  frames  to  properly  determine  the  parameters  to  be  used  for  synthesis  of  one 
frame,  they  Introduce  new  problems  with  transmission  and  selection  rules  for  the 
transmitter  and  receiver.  This  report  details  the  problems  which  we  have 
detected  and  describes  the  methods  Incorporated  in  the  CHI  Implementation  of  the 
VFR  LPC-II  network  speech  system  to  deal  with  them. 

Good  quality  speech  reproduction  In  a network  speech  compression  system 
depends  on  more  than  the  Inherent  qualities  of  the  vocoder.  It  Is  also  neces- 
sary that  the  receiver  be  able  to  supply  the  synthesizer  with  parameters  at  a 
fairly  constant  rate,  without  gaps  exceeding  the  backlog  of  data  already  prepared 
for  output.  Providing  a steady  supply  of  parameters  is  complicated  by  the 
requirement  for  breaking  the  continuous  stream  of  parcels  of  parameters  Into 
discrete  messages  for  transmission  over  the  ARPANET,  These  separate  messages 
may  take  different  amounts  of  time  to  reach  their  destination,  may  be  lost  or 
arrive  out  of  order  and  may  contain  different  numbers  of  parcels.  Further 
complicating  the  problem  of  providing  continuous  speech  output  Is  the  need  to 
minimize  the  total  delay  In  the  system  in  order  to  permit  Interactive  conversa- 
tions. Chapter  II  presents  a more  detailed  discussion  of  the  Issues  Involved 
In  maintaining  speech  quality  on  the  ARPANET  and  describes  the  protocols  used 
for  network  voice  communication  as  a background  for  the  Implementation  discussion. 

Much  of  the  responsibility  for  quality  network  speech  communication  rests 
with  the  transmitter.  By  appropriately  selecting  the  threshold  for  parameter 
transmission  in  the  LPC-II  system  it  can  increase  the  speech  quality  or  lower 
the  transmission  rate  to  adjust  for  varying  network  performance.  By  decreasing 
the  number  of  parcels  per  message,  the  transmitter  can  cut  the  delays  due  to 
message  loading  time  and  transmission  bit  rates  at  the  expense  of  additional 
overhead. 

The  transmitter  must  also  take  into  account  the  relationship  between 
variable  frame  coding  and  message  boundaries.  In  particular,  any  time  parcels 
are  not  transmitted,  whether  due  to  silence  detection  or  speaker  switching, 
the  transmitter  must  send  all  parameters  for  the  first  parcel  after  the  gap  and 
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should  attempt  to  send  all  parameters  In  the  last  parcel  before  a gap.  A more 
detailed  explanation  of  the  transmitter  section  of  the  Culler/Harrison  network 
voice  system  is  provided  in  Chapter  III. 

The  receiver's  function  is  to  make  use  of  the  information  received  over  the 
ARPANET  to  provide  as  smooth  an  output  as  possible,  preserving  the  audio  fidelity, 
without  unnecessarily  increasing  the  delay.  One  problem  faced  is  how  to  estimate 
the  minimum  delay  needed  before  starting  output  of  the  start  of  a speech  segment 
In  order  to  be  sure  all  parcels  of  parameters  will  arrive  in  time  to  permit 
synthesis  of  speech  with  no  gaps.  An  approach  developed  for  this  estimation  Is 
to  maintain  a continually  updated  estimate  of  the  time  a message  will  need  to 
traverse  the  net,  then  adding  an  additional  delay  to  allow  for  variations  in 
network  performance.  This  approach  does  not  tie  the  output  time  of  a speech  seg- 
ment to  the  arrival  time  of  any  particular  message. 

A second  area  of  concern  for  the  receiver  is  proper  interpretation  of  the 
variable  frame  rate  parameter  information  to  produce  valid  parameters  for  each 
frame.  In  particular,  care  must  be  taken  to  avoid  interpolation  between  para- 
meters from  different  speakers  or  across  gaps  if  high  fidelity  is  to  be  maintained. 
The  final  chapter  describes  the  CHI  receiver  implementation,  providing  more 
detail  about  the  processing  of  messages  in  a variable  frame  rate  system. 


II.  NETWORK  SPEECH  COMMUNICATION 
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Variable  Frame  Rate  Speech  Compression 

The  variable  frame  rate  transmission  algorithm  Is  a method  of  further 
reducing  the  average  transmission  rate  required  for  digital  voice  communication. 
It  does  this  by  not  transmitting  selected  parameters  describing  a frame  of 
speech  when  these  parameters  do  not  differ  significantly  from  the  last  set 
transmitted.  This  converts  a fixed  frame  rate  system,  where  new  parameters 
are  computed  every  9.6  milliseconds,  to  a variable  rate  system  In  which  para- 
meters are  updated  only  when  they  have  changed  significantly.  Actually, 
separate  decisions  are  made  for  transmission  of  pitch,  gain  and  reflection 
coefficient  parameters,  so  the  VFR  system  varies  the  number  of  bits  per  frame 
while  maintaining  a fixed  frame  rate.  Three  bits  are  transmitted  with  each 
frame  to  Indicate  what  parameters  are  Included.  The  variable  frame  rate 
algorithm  decreases  the  number  of  bits  transmitted  during  continuous  speech 
from  47  to  an  average  of  less  than  20  per  frame. 

Transmission  rates  are  further  reduced  In  a packet  network  by  recognizing 
periods  of  silence  when  the  gain  parameter  stays  below  a selected  threshold. 

No  parameters  or  frame  Information  are  transmitted  during  these  silence  periods, 
leaving  the  entire  network  bandwidth  available  for  other  traffic.  In  a typical 
two-way  conversation,  each  speaker  will  be  transmitting  less  than  half  the  time 
because  of  silence  detection.  The  Network  Voice  Conference  system  provides 
for  additional  reduction  In  network  traffic  by  restricting  speech  transmission 
to  the  current  speaker. 

Maintaining  Speech  Quality 

The  quality  of  speech  obtained  In  a network  speech  compression  system 
depends  on  a number  of  factors.  Cohen  has  Identified  three  factors:  the 

acoustic  quality  or  fidelity  of  the  reproduction.  Its  continuity  or  smoothness, 
and  the  delay  between  the  original  speech  and  its  reproduction  at  the  destina- 
tion [1].  The  problen  for  a system  is  to  maintain  high  fidelity  and  continuity 
of  the  reproduced  speech  while  minimizing  the  end-to-end  delay  and  not  exceeding 
the  desired  transmission  rate. 
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The  question  of  fidelity  of  the  reproduced  speech  has  been  examined  In 
some  length  In  the  previous  technical  report  [2].  To  obtain  continuity  or 
smoothness  In  the  reproduced  speech • we  must  be  able  to  synthesize  the  frames  of 
data  at  a rate  sufficient  to  provide  a steady  audio  signal  output.  The  problem 
here  Is  not  the  computation  time  required,  which  Is  very  little  In  modem  sig- 
nal processors.  It  Is  assuring  the  availability  of  a steady  stream  of  parameters 
to  control  the  synthesis.  When  the  effects  of  the  network  are  removed,  as  when 
the  transmitter  and  receiver  are  located  at  the  same  site,  the  LFC-II  system 
vocoder  provides  satisfactory  acoustic  fidelity  while  considerably  reducing  the 
average  transmission  rate.  In  the  ARPANET,  several  parcels  of  parameters  for  a 
frame  are  transmitted  as  one  network  message  which  arrives  as  a unit.  Essen- 
tially all  messages  eventually  reach  their  destination.  There  Is  a considerable 
variation,  however.  In  the  time  a message  may  take  to  reach  Its  destination. 

If  the  receiver  has  output  all  the  data  generated  from  the  parcels  In  one  message 
before  the  following  message  has  arrived,  a discontinuity  will  occur  In  the 
speech.  To  avoid  this  problem,  the  receiver  can  wait  before  starting  to  output 
data  generated  after  a silence  or  when  a new  speaker  begins.  The  delay  must  be 
equal  to  the  maximum  deviation  of  any  message's  travel  time  from  that  of  the 
first  message.  This  time  can  only  be  estimated,  of  course.  If  the  estimate  Is 
too  low,  a break  will  occur  In  the  output.  The  delay  Is  also  subject  to  con- 
siderable variation  as  network  performance  changes. 

When  uncontrolled  network  messages  are  used  to  gain  Increased  throughput 
the  variations  can  cause  additional  problems.  Messages  may  arrive  out  of  order 
or  even  be  lost  completely.  If  enough  delay  Is  not  Included  to  allow  for 
reordering  at  the  receiver,  glitches  are  created  because  of  missing  parameters 
from  the  late  message.  With  variable  frame  rate  algorithms,  frames  at  the 
beginning  of  the  following  message  may  be  unusable  because  they  need  parameters 
from  frames  In  the  late  message.  Hence  the  loss  or  excessive  delay  of  one 
message  may  make  It  impossible  to  correctly  synthesize  more  time  than  that  repre- 
sented by  the  message. 

The  third  dimension  of  the  quality  of  a speech  compression  system  is  the 
amount  of  delay  which  accumulates  In  the  end-to-end  transmission  from  speaker 
to  listener.  This  delay  Is  apparent  to  a person  using  the  system  as  the  time 
he  must  wait  to  receive  a response  to  anything  he  says.  It  Is  most  disruptive 
In  uncontrolled,  full  duplex  conversations  when  two  people  may  begin  speaking 


and  neither  will  realize  that  the  other  la  talking  for  this  delay  time.  If 
the  delay  Is  more  than  1/2  to  1 second,  considerable  adjustment  Is  required, 
Including  careful  alternation  of  speakers  and  explicit  Indication  by  each  of 
when  they  are  done  talking.  In  controlled  conference  situations,  where  only 
one  speaker  at  a time  Is  permitted  by  the  protocols,  the  problems  are  less, 
but  delays  of  over  a second  are  annoying. 

The  total  delay  from  source  to  destination  includes  several  parts.  The 
vocoder  analysis  requires  several  frames  of  data  to  compute  the  parameters  for 
each  frame.  Introducing  a delay  of  40  to  50  milliseconds.  The  packing  of  many 
frames  of  data  Into  each  message  for  network  transmission  adds  a delay  equal 
to  the  total  amount  of  speech  represented  by  the  message.  As  the  efficiency 
of  the  vocoder  improves  In  minimizing  the  average  number  of  bits  per  frame  the 
number  of  frames  which  can  be  packed  Into  one  fixed  length  message  Increases. 

For  example,  a single  packet  uncontrolled  speech  data  message  can  have  up  to 
938  data  bits.  This  could  represent  from  180  milliseconds  to  over  3 seconds 
worth  of  speech. 

The  time  a message  spends  In  the  network,  traveling  from  source  to  des- 
tination, Is  an  additional  delay.  It  Is  assumed  this  delay  Increases  for 
longer  messages,  but  probably  not  proportionately.  A lower  bound  on  the  net- 
work time  can  be  obtained  by  counting  the  number  of  IMF's  through  which  a 
message  must  pass  along  Its  shortest  possible  path  and  adding  the  times  for 
the  message  to  be  serially  transferred  Into  each.  For  1000-blt  messages  and 
50  Kbps  lines  this  time  Is  20  mllllseconds/IMF  [3].  For  a relatively  short 
path  from  CHI  to  ISI  Involving  five  Intermediate  IHPs,  this  adds  at  least  100 
milliseconds  to  the  delay.  Observed  times  for  short  control  messages  Indicate 
that  about  80  milliseconds  must  be  added  to  this  for  minimum  delays  due  to  IMP 
queuing  and  processing  and  transfer  to  local  HOST  processors. 

The  total  minimum  delay  for  a message  Is  the  sum  of  these  Individual 
delays  and  totals  220  milliseconds  plus  the  amount  of  speech  In  the  message. 
However,  as  observed  In  the  discussion  of  maintaining  continuity  of  speech, 
an  additional  delay  must  be  added  to  account  for  the  variation  In  arrival 
time  for  messages.  By  examining  the  factors  that  make  up  this  variation, 
we  see  that  It  Includes  the  variation  In  the  amount  of  speech  packed  Into 
each  message  and  variations  In  the  network  performance.  The  first  factor 
can  be  controlled  by  limiting  the  number  of  parcels  per  message  to  produce 
a fairly  fixed  delay  due  to  message  packing.  If  exactly  18  parcels  are  Included 


in  each  message,  we  obtain  a minimum  delay  of  about  400  milliseconds.  We  must 
still  add  sufficient  delay  for  the  greatest  expected  deviation  In  network 
performance.  If  this  addition  Is  under  100  milliseconds  we  will  have  a total 
delay  which  Is  probably  acceptable.  Longer  delays  which  result  from  lower  net- 
work performance  or  longer  transmission  paths  (CHI  to  LINCOLN  Is  10  IMPs)  make 
uncomfortably  long  delays. 

Network  Voice  Protocols 


The  efforts  to  provide  good  quality  digital  voice  communication  on  the 
ARPANET  have  resulted  In  the  development  of  protocols  to  provide  standardization 
of  the  form  of  this  communication  and  to  Insure  that  enough  Information  Is 
provided  to  the  receiver  to  enable  him  to  accurately  reconstruct  the  speech 
waveform.  These  protocols  provide  for  control  messages  which  permit  establish- 
ment of  network  voice  communication,  agreement  on  the  form  of  vocoding  and 
control  of  speakers  when  necessary.  We  will  not  review  these  commands  In  this 
report  since  they  are  not  the  area  of  concern  here.  We  will  concentrate  on 
the  rules  and  conventions  for  data  transfer. 

Each  data  message.  In  addition  to  a number  of  parcels  of  frame  parameters, 
carries  the  standard  HOST/IMP  leader  and  a 32-bit  network  voice  protocol  header. 
The  leader  contains  the  HOST-ID  of  the  destination  or  source  and  the  LINK 
number  specified  by  the  receiving  host  for  data  messages.  A separate  LINK  Is 
used  for  control  messages.  The  HOST-ID  may  be  used  by  the  receiver  to  dis- 
criminate between  messages  arriving  from  different  speakers  in  a voice  conference. 
The  NVP  header  contains  five  fields: 

TIME  - a 16-blt  time  stamp,  giving  the  parcel  number  of  the  first  parcel 
In  the  message. 

SP  - a 1-bit  sklpped-parcels  flag.  Set  when  parcels  immediately  preceding 
the  message  were  not  transmitted. 

PC  - a 7-blt  parcel  count  of  the  number  of  parcels  In  the  message. 

ST  - a 1-blt  Indicator  of  whether  this  data  Is  part  of  the  primary  or 

secondary  data  stream  during  a conference. 

EXT  - a 7-blt  extension  of  the  individual  speaker  whose  speech  Is  being 
transmitted. 

The  extension  and  stream  Infonsatlon  are  combined  with  the  HOST-ID  to  pro- 
vide coiq>lete  speaker  Identification. and  permit  up  to  two  simultaneous  speakers 
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talking  to  subsets  of  the  total  group  under  proposed  extensions  to  the  conference 
protocols.  The  time  stamp  and  parcel  count  enable  the  receiver  to  properly  order 
the  messages  as  received  and  detect  delayed  or  missing  messages.  They  are  also 
valuable  In  determining  the  variations  in  network  performance  In  order  to  esta- 
blish the  appropriate  delay  before  synthesis.  The  sklpped-parcels  flag  provides 
an  explicit  Indicator  of  the  beginning  of  a speech  burst  and  may  be  used  for 
Initiation  of  the  synthesis  delay. 

The  parcels  field  In  a data  message  contains  a variable  number  of  parcels, 
each  representing  one  speech  frame.  With  variable  frame  rate  transmission  the 
size  of  each  parcel  may  vary.  Each  parcel  begins  with  three  "presence-bits" 
which  are  one  If  the  corresponding  parameter  type  Is  present  for  that  parcel 
and  zero  If  It  Is  not.  If  all  three  bits  are  zero  the  parcel  Is  only  three 
bits  long.  If  all  bits  are  on,  the  parcel  Is  50  bits  long. 

The  variable  frame  rate  transmission  algorithm  requires  only  a limited 
amount  of  agreement  between  the  transmitter  and  receiver.  Gain  or  reflection 
coefficients  are  transMtted  whenever  their  distance  from  the  previous  ones 
transmitted  exceeds  a threshold.  The  receiver  uses  linear  Interpolation  to 
fill  In  missing  values  If  there  Is  less  than  100  milliseconds  to  the  next 
occurrence  of  the  parameter.  Otherwise,  or  if  a voicing  change  Intervenes, 
the  old  values  are  used.  Interpolation  Is  used  to  avoid  sharp  transitions  to 
the  next  transmitted  values.  Pitch  is  transmitted  each  time  It  changes,  and 
no  Interpolation  Is  used  at  the  receiver. 

Explicit  silence  detection  Is  provided  for  by  defining  a silence  threshold, 
measured  In  the  same  units  as  gain,  and  a time  before  silence  interval.  When 
the  transmitter  detects  that  the  calculated  gain  has  been  below  the  threshold 
value  for  the  specified  amount  of  time,  it  ceases  transmitting  parcels,  dls- 
caidlng  all  but  the  most  recent  ones.  When  the  gain  once  again  exceeds  the 
threshold,  transmission  Is  resumed  beginning  with  the  parcels  saved.  The 
sklpped-parcels  bit  Is  set  in  the  MVP  header  of  the  first  message  after  silence. 

The  receiver  has  no  explicit  indicator  when  the  transmitter  has  declared 
silence.  It  Is  necessary  to  Infer  silence  from  the  absence  of  further  messages. 
The  sklpped-parcels  bit  in  a message  assure  the  receiver  that  this  message  Is 
the  first  after  silence,  rather  than  being  out  of  order,  and  can  be  used  to 
start  the  delay  after  silence. 
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III.  LPC-II  TRANSMITTER  SYSTEM 


The  transmitter's  function  Is  to  take  an  Input  speech  signal  and  produce 
output  data  messages  which  are  sent  over  the  ARPANET.  It  must  provide  the 
information  needed  by  the  receiver  to  reproduce  the  speech  while  attempting  to 
minimize  both  the  bit  rate  and  the  delay  in  delivering  the  message.  The  trans- 
mitter performs  Its  function  In  a number  of  separate  steps,  each  of  which  Is 
Implemented  as  a separate  process  within  the  total  system;  figure  1 Illustrates 
the  data  flow  through  these  processes. 

Data  Input 

The  Input  signal  may  come  from  one  of  four  voice  terminals  which  can  be 
connected  to  the  CHI  system.  The  selection  of  which  terminal  to  use  Is  under 
control  of  the  Local  Conference  Controller  (LCC)  program,  which  acts  on  control 
messages  from  the  conference  chairman  (CHAIR).  The  signal  Is  low  pass  filtered 
and  sampled  at  6.67kHz  by  an  analog-to-dlgltal  converter.  The  data  Is  stored 
in  an  Input  data  buffer  until  one  frame  of  data  (64  points  or  9.6  milliseconds) 
has  been  collected.  This  buffer  Is  then  queued  for  processing. 

LPC  Processing 

The  computational  parts  of  the  linear  predictive  coding  of  the  speech  data 
are  performed  In  a separate  processor,  the  AF90.  The  analysis  calculation  Is 
performed  at  a fixed  frame  rate  of  about  104  frames/second.  The  AP  computes 
the  reflection  coefficients  (Ks)  gain,  and  pitch  and  voicing  estimates.  It  also 
performs  the  likelihood  ratio  test  to  determine  If  the  reflection  coefficients 
should  be  transmitted.  If  the  coefficients  are  to  be  transmitted.  It  computes 
Che  autocorrelation  of  Che  predictor  coefficients  to  use  In  future  tests.  The 
result  of  this  test  Is  reported  to  the  MP  along  with  the  Ks,  gain,  pitch  and 
voicing. 

Parcel  Processing 

The  output  of  the  AP  anelysis  is  processed  by  ANFOST,  The  pitch  and  voicing 
estimates  are  refined  to  produce  the  pitch  parameter.  The  pitch,  gain  and  Ks 
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are  then  encoded  in  separate  words  and  the  VFR  distance  tests  are  performed  to 
determine  which  parameters  will  be  transmitted.  The  3-blt  frame  header  Is  con- 
structed at  this  time  and  this  header  Is  stored  with  the  encoded  parameters  In 
the  output  list.  The  header  code  Is  used  to  Index  a table  of  bit  lengths  for 
the  frame  and  the  entry  is  added  to  the  bit  count  for  the  output  list. 

A silence  test  is  performed  by  comparing  the  gain  of  the  frame  to  a threshold 
value;  if  the  gain  Is  below  the  threshold  for  a specified  number  of  consecutive 
frames,  silence  Is  declared.  If  silence  has  not  been  declared,  the  bit  count 
for  the  list  is  tested  to  see  If  there  will  be  room  for  at  least  one  more  parcel. 
If  not,  or  if  the  number  of  parcels  has  reached  a prescribed  upper  limit,  the 
message  packing  process  Is  Initiated.  The  two  tests  are  needed  because  of  the 
wide  variation  in  the  number  of  parcels  that  will  fit  in  one  message.  The  maxi- 
mum number  of  parcels  per  frame  is  limited  to  41.  These  limits  are  absolute 
upper  bounds.  Normally,  a lower  limit  on  the  number  of  parcels  is  tested  when 
network  performance  is  such  that  there  is  no  backlog  of  messages  to  send.  This 
lower  value  is  usually  from  15  to  25  parcels  per  message  to  give  message  loading 
times  of  145  to  240  milliseconds. 

If  silence  is  declared,  the  last  parcel's  header  is  forced  to  seven, 
causing  all  parameters  to  be  transmitted.  The  message  packing  process  is  ini- 
tiated to  send  whatever  parcels  are  in  the  output  list. 

Once  silence  has  been  declared,  gain  of  each  parcel  is  tested  to  see  if  it 
exceeds  the  silence  threshold.  As  long  as  the  gain  remains  below  the  threshold, 
silence  is  continued.  The  output  list  is  allowed  to  build  up  to  eight  parcels. 
Once  it  has  reached  eight  parcels,  the  oldest  parcel  is  discarded  and  the  bit 
count  is  adjusted  each  time  a new  parcel  is  stored.  When  the  silence  threshold 
is  exceeded,  silence  is  terminated.  If  the  output  list  had  reached  eight  par- 
cels, the  sklpped-parcels  flag  is  set  for  the  message  packing  process.  The 
output  list  is  allowed  to  continue  growing  to  normal  message  sizes. 

Message  Preparation 

Message  packing  is  a separate  process  which  normally  is  scheduled  only  when 
all  copies  of  previously  prepared  messages  have  been  delivered  to  the  network. 

As  noted  above.  If  enough  data  is  present  to  completely  fill  a message,  the 
packing  processes  will  be  entered  directly.  This  allows  longer  messages  when 
network  performance  is  very  poor;  thereby  making  more  efficient  use  of  the 
network. 
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If  there  Is  no  local  speaker,  the  output  list  Is  cleared  without  preparing 
a message.  Otherwise,  the  frame  number  of  the  first  parcel,  the  parcel  count, 
sklpped-parcels  flag  and  speaker's  extension  are  merged  to  form  the  NVP  header 
for  a new  data  message.  If  the  skipped  parcels  flag  Is  set  the  header  code  for 
the  first  parcel  In  the  message  Is  set  to  seven,  ensuring  that  all  parameters 
are  present.  The  coded  parameters  and  code  for  each  parcel  are  then  packed 
into  the  new  message  using  the  code  to  determine  which  parameters  are  to  be 
Included.  The  new  message  is  placed  on  an  output  message  queue  to  await  trans- 
mission. 

Message  Transmission 

The  message  transmission  process  is  scheduled  when  the  output  message  queue 
is  not  empty.  It  uses  a list  of  HOST-IDs  and  LINKs  provided  by  the  conference 
CHAIR  to  generate  the  HOST-IMP  leaders.  One  copy  of  the  message  is  sent  to  the 
network  through  a separate  Input/output  processor  each  time  this  process  is  run. 
When  the  last  HOST  entry  is  used  the  message  is  removed  from  the  output  queue 
and  discarded. 

Since  present  protocols  require  transmission  of  one  copy  of  each  data 
message  to  each  HOST  in  a network  voice  conference,  the  time  to  deliver  these 
copies  to  the  network  becomes  a factor  in  the  total  delay.  This  is  particularly 
true  at  HOSTs  such  as  CHI  which  have  VDU  connections  to  their  IMP.  The  serial 
transfer  time  over  our  VDH  connection  adds  about  20  milliseconds  to  the  delay 
for  each  HOST  the  message  is  sent  to.  As  a small  measure  in  reducing  this 
delay,  we  short-circuit  the  normal  transfer  of  messages  to  the  IMP  if  their 
destination  is  our  own  HOST.  Messages  for  our  HOST  are  delivered  directly  to 
our  input  message  processor,  eliminating  both  the  network  delay  for  the  local 
copy  and  the  addition  of  20  milliseconds  of  delay  to  copies  to  other  HOSTs. 

Conference  Control 


The  flow  of  messages  from  our  transmitter  is  gated  by  the  local  conference 
controller,  which  acts  under  orders  from  the  CHAIR  to  allow  message  preparation 
and  determine  which  voice  terminal  Is  connected  to  the  vocoder.  It  is  also 
necessary  for  the  LCC  Co  Inlciallze  the  output  list  when  a local  speaker  is  selected. 
The  skipped  parcels  flag  is  set.  This  provides  information  for  the  receiver  to 
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adjust:  its  delay  for  the  new  speaker  and  ^guarantee  that  all  parameters  will  be 
provided  for  the  first  parcel  in  the  initial  message. 

When  a speaker's  turn  is  ended  by  a message  from  the  CHtVIR,  no  further 
data  messages  can  be  sent  to  the  network.  Hence  it  is  not  possible  to  Insure 
that  the  last  parcel  sent  will  have  all  parameters  present,  which  would  guarantee 
that  parameter  interpolation  could  be  performed  by  the  receiver  where  needed. 

This  situation  must  be  handled  by  the  receiver. 

Control  of  the  Transmitter 


There  are  several  parameters  which  are  available  to  adjust  the  performance 
of  the  transmitter.  Proper  adjustment  of  these  parameters  can  help  compensate 
for  poor  network  response  and  noisy  environments  or  take  advantage  of  good  per- 
formance to  get  better  speech  quality.  Controls  are  provided  in  the  CHI  network 
voice  system  to  permit  control  of  these  parameters  dynamically  from  a system 
keyboard : 

a.  The  silence  threshold  and  the  number  of  frames  below  this  threshold 
before  silence  is  declared  can  be  changed.  Raising  the  threshold  allows  opera- 
tion in  a noisier  environment  without  background  noise  causing  transmission  when 
the  speaker  is  silent. 

b.  The  test  value  used  for  the  likelihood  ratio  test  for  transmission  of 
reflection  coefficients  can  be  raised  to  decrease  the  number  of  bits  transmitted 
or  lowered  to  obtain  better  quality. 

c.  The  minimum  number  of  parcels  to  be  packed  into  each  message  can  be 
lowered  to  shorten  delays  due  to  message  loading  or  raised  to  make  more  efficient 
use  of  the  network. 

In  addition,  the  bit  count  for  each  message  is  accumulated  and  read  each 
second,  then  smoothed  to  provide  a transmission  rate  indicator.  This  rate  is 
displayed  as  a plot  against  time  continuously  and  can  be  printed  at  the  terminal 
on  request. 
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IV.  LPC-Il  RECEIVER 


The  receiver's  function  is  to  accept  data  messages  arriving  from  the 
transmitter  over  the  ARPANET  and  generate  a synthetic  audio  signal  which  is 
played  out  to  local  listeners  at  their  voice  terminals.  The  receiver  must 
attempt  to  produce  an  output  signal  despite  the  partitioning  of  parcels  into 
messages  and  the  variations  in  the  time  these  messages  take  to  reach  it  from 
the  transmitter.  It  must  properly  Interpret  the  variable  frame  rate  data  to 
provide  the  parameters  actually  used  for  synthesis.  It  must  recognize  silent 
intervals  when  no  messages  are  received.  In  meeting  these  requirements,  the 
receiver  attempts  to  add  as  little  delay  as  possible  to  the  end  to  end  delays 
in  the  network  voice  communication. 

The  receiver  is  part  of  the  same  system  as  the  transmitter,  and  must  share 
processor  and  memory  resources  with  it.  It  is  also  made  up  of  a number  of 
separate,  cooperating  processes,  as  illustrated  in  figure  2. 

Data  Message  Input 

Each  message  as  it  arrives  from  the  network  is  classified  as  data  only  if 
it  arrives  on  the  specified  data  LINK  as  specified  by  its  IMP/HOST  leader. 

Only  those  data  messages  from  the  current  speaker,  as  shown  by  their  HOST-ID, 
STREAM  and  EXTENSION  fields,  are  retained.  All  other  data  messages  are  dis- 
carded. Messages  from  the  current  speaker  are  Inserted  in  an  input  queue  for 
the  LPC  synthesizer.  The  messages  are  kept  in  time  stamp  order,  even  though 
they  are  usually  sent  as  uncontrolled  network  messages  and  may  arrive  out  of 
order. 

Message  Selection 

An  attempt  is  made  to  select  a new  message  for  synthesis  whenever  there  is 
not  enough  Information  available  in  the  previous  message  to  prepare  a set  of 
parameters  for  the  synthesizer.  Since  not  all  parameters  are  transmitted  in 
each  parcel,  and  linear  interpolation  is  the  preferred  method  for  filling  in 
missing  parameters,  there  may  be  up  to  ten  parcels  of  parameters  left  in  the 
current  message  when  a new  message  Is  needed.  (If  the  next  ten  parcels  do  not 
have  a given  parameter,  the  previous  value  is  used  rather  than  interpolating.) 
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Figure  2:  Receiver  Data  Flow 
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In  order  to  allow  for  variations  In  network  performance  or  the  number  of 
parcels  In  a message,  both  of  which  can  affect  the  time  between  messages.  It  Is 
necessary  to  establish  a buffering  Interval,  or  delay,  before  processing 
the  first  message  after  a break  in  transmission  because  of  silence  or  a change 
In  speaker.  The  length  of  the  delay  should  be  sufficient  to  assure  that  a 
message  will  be  available  when  needed. 

It  Is  possible  to  choose  a delay  equal  to  the  sum  of  the  expected  variation 
In  the  network  performance,  the  maximum  difference  In  parcels  per  message  and 
an  allowance  for  the  need  to  Interpolate  using  values  In  the  next  message  to 
determine  parameters  for  parcels  In  the  previous  message.  This  delay  can  be 
added  to  the  arrival  time  of  the  first  mesage  to  give  the  time  when  that  message 
Is  to  be  used.  The  disadvantage  of  this  approach  Is  that  the  maximum  expected 
variation  must  be  used  for  the  network  performance  factor  since  It  Is  not  known 
whether  the  first  message  arrived  quickly  or  slowly.  Also,  since  the  variation 
In  arrival  time  of  the  first  message  Is  reflected  In  the  time  when  the  message 
Is  played  out,  the  length  of  silence  periods  Is  not  well  preserved. 

An  alternative  method  for  determining  the  delay  before  playing  out  a 
message  has  been  developed  which  permits  some  reduction  In  the  delay  and  in  the 
variation  In  silence  Intervals.  An  expected  network  travel  time  (NT)  is  computed 
by  smoothing  the  difference  between  the  arrival  time  of  a message  and  the  time 
It  was  sent  (Time  Stamp  + Parcel  Count) . This  time  actually  Includes  the 
difference  In  time  frames  of  the  transmitter  and  receiver.  NT  Is  added  to  the 
time  stamp  of  each  message  to  give  the  expected  arrival  time  of  a message  with 
no  parcels.  To  this  time  we  add  a delay  (D)  which  Is  the  sum  of  the  maximum 
expected  number  of  parcels/message  (25  - 40),  the  number  of  parcels  from  one 
message  which  may  depend  on  the  next  (10)  and  the  expected  variation  about  the 
network  time.  This  gives  the  time  when  the  message  should  be  processed: 

Time  “ Time  Stamp  + NT  + D. 

The  value  of  NT  Is  adjusted  each  time  a message  Is  selected  for  synthesis  using 
the  observed  time  for  that  message.  The  delay  time  D could  be  adjusted  to 
reflect  network  performance  variations.  At  present,  the  CHI  system  permits 
setting  the  value  for  D dynamically  but  otherwise  holds  it  fixed. 

If  the  first  message  on  the  queue  contains  parcels  Immediately  following 
those  in  the  previous  message,  it  is  selected  for  synthesis  without  delay.  If 
there  are  no  messages  on  the  queue,  then  either  the  speaker's  transmitter  has 
stopped  sending  messages  or  the  message  is  delayed  or  lost.  As  long  as  there 
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is  a backlog  of  frames  of  synthesized  data  waiting  to  be  played  out,  more  para- 
meters are  not  needed  and  no  special  processing  Is  needed.  VHien  the  backlog  of 
frames  for  output  Is  exhausted  (and  the  last  frame  Is  being  played  out) , any 
remaining  parcels  are  used  to  prepare  parameters  for  synthesis  of  frames,  with 
missing  parameters  filled  In  by  using  the  most  recent  values  that  have  been 
received.  When  all  parcels  have  been  used,  and  all  frames  are  output.  It  Is 
assumed  that  the  transmitter  Is  silent  and  all  zero  frames  are  played  out. 

If  the  first  message  on  the  Input  queue  has  a time  stamp  greater  than 
expected.  It  may  have  arrived  out  of  order,  messages  may  have  been  lost  or  It 
may  be  the  first  message  after  silence.  In  any  case,  the  desired  playout  time 
Is  computed  from  Its  time  stamp,  NT  and  D.  If  the  time  has  not  arrived,  the 
message  Is  not  processed.  While  waiting  for  the  playout  time,  any  remaining 
parcels  from  the  previous  message  are  used  or  all  zero  frames  are  played  out  as 
described  above. 

If  the  variations  In  network  performance  exceed  that  expected,  a message 
may  arrive  out  of  order  and  delayed  enough  that  the  following  message  has 
already  been  processed.  In  this  case,  the  time  stamp  of  the  new  message  will 
be  less  than  that  of  Che  expected  message.  The  new  message  must  be  discarded. 

A diagnostic  Is  printed  at  Che  terminal  in  this  case. 

Once  a message  Is  selected,  its  parcels  are  unpacked  Into  Individual  para- 
meters and  parcel  header  codes.  Each  parcel  occupies  a fixed  length  buffer 
within  a parcel  list  with  any  residual  parcels  from  Che  previous  message  at  the 
top  of  Che  list. 

Parcel  Processing  and  Parameter  Preparation 

Parameters  are  prepared  for  synthesis  processing  one  frame  at  a time.  The 
array  processor  performs  the  parameter  interpolation  and  synthesis  filtering. 

It  maintains  the  beginning  of  frame  parameters  from  the  previous  frame.  The 
main  processor  provides  a set  of  interpolation  weights  and  transmitted 
parameters.  The  array  processor  then  Interpolates  between  these  parameters  and 
the  beginning  of  frame  parameters  to  obtain  the  end  of  frame  parameters. 

The  location  of  the  next  set  of  transmitted  parameters  for  each  of  pitch, 
gain  and  reflection  coefficients  (Ks)  is  performed  by  the  MP.  The  location  of 
the  paraiwters  and  determination  of  the  proper  Interpolation  coefficients  is  a 
three-step  process.  First,  the  header  codes  of  all  parcels  in  the  buffer  are 
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OR'ed.  If  the  result  Is  seven,  or  If  there  are  ten  or  more  parcels  left,  there 
Is  enough  Information  to  .determine  a set  of  parameters.  If  not,  an  attempt  Is 
made  to  get  more  parcels  from  the  next  message  as  described  earlier.  Next,  the 
list  of  parcels  Is  searched  for  the  first  occurrence  of  each  type  of  parameter. 

The  number  of  parcels  passed  by  is  used  to  look  up  the  Interpolation  coefficient. 
If  this  distance  Is  greater  than  nine,  the  coefficient  Is  0,  causing  the  old 
value  to  be  used.  If  the  distance  Is  0,  the  coefficient  Is  1,  and  the  new  value 
will  be  used.  The  third  step  prevents  interpolation  between  voiced  and  unvoiced 
parcels.  The  list  of  parcels  Is  again  searched,  but  this  time  starting  with 
the  parcel  containing  the  parameter  and  working  backvrards  to  the  first  occurrence 
of  a pitch  parameter.  If  this  pitch  does  not  have  the  same  voicing  as  the 
beginning  of  frame  parameters,  the  Interpolation  value  Is  set  to  0 unless  It  Is  1. 
The  pitch  parameter  Is  never  Interpolated  since  It  Is  transmitted  whenever  It 
changes . 

Effect  of  Missing  Parcels 

When  there  are  missing  parcels,  as  occurs  during  silence  or  when  speakers 
change  the  parameters  remaining  in  the  AP  will  not  be  related  to  the  first 
parcel  parameters  of  the  next  message.  The  old  parameters  are  cleared  by  sending 
an  extra  set  of  parameters  to  the  AF  for  synthesis  when  silence  Is  recognized. 
These  parameters  are  all  0 with  Interpolation  values  of  1.  The  data  generated 
In  this  case  gives  the  final  transition  into  silence.  When  the  new  parameters 
arrive,  the  first  frame  synthesized  will  provide  a proper  transition  out  of 
silence. 

If  the  first  parcel  In  the  first  message  does  not  contain  all  parameters, 
the  receiver  will  not  be  able  to  process  It.  These  parcels  can  be  replaced 
by  additional  silence  until  all  parameters  have  been  found.  This  problem  will 
not  normally  arise  if  all  transmitters  force  the  first  parcel  to  be  complete 
whenever  the  skipped  parcels  bit  is  set  in  the  message  header.  It  can  still 
occur  if  messages  are  lost  or  very  late. 
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